In [12]:
# helloMachine.r
# MWL, Lecture 1
# Author(s): [Phil Snyder]

In [13]:
data(iris) # load data (dataset "iris" comes with your R installation)

In [14]:
"
R doesn't support multi-line comments, but we can get away with 
benignly passing it a string instead.
"


Out[14]:
" R doesn't support multi-line comments, but we can get away with benignly passing it a string instead. "

In [15]:
iris


Out[15]:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
553.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
853.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.831.40.1setosa
144.331.10.1setosa
155.841.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.610.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
26531.60.2setosa
2753.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
314.83.11.60.2setosa
325.43.41.50.4setosa
335.24.11.50.1setosa
345.54.21.40.2setosa
354.93.11.50.2setosa
3653.21.20.2setosa
375.53.51.30.2setosa
384.93.61.40.1setosa
394.431.30.2setosa
405.13.41.50.2setosa
4153.51.30.3setosa
424.52.31.30.3setosa
434.43.21.30.2setosa
4453.51.60.6setosa
455.13.81.90.4setosa
464.831.40.3setosa
475.13.81.60.2setosa
484.63.21.40.2setosa
495.33.71.50.2setosa
5053.31.40.2setosa
5173.24.71.4versicolor
526.43.24.51.5versicolor
536.93.14.91.5versicolor
545.52.341.3versicolor
556.52.84.61.5versicolor
565.72.84.51.3versicolor
576.33.34.71.6versicolor
584.92.43.31versicolor
596.62.94.61.3versicolor
605.22.73.91.4versicolor
61523.51versicolor
625.934.21.5versicolor
6362.241versicolor
646.12.94.71.4versicolor
655.62.93.61.3versicolor
666.73.14.41.4versicolor
675.634.51.5versicolor
685.82.74.11versicolor
696.22.24.51.5versicolor
705.62.53.91.1versicolor
715.93.24.81.8versicolor
726.12.841.3versicolor
736.32.54.91.5versicolor
746.12.84.71.2versicolor
756.42.94.31.3versicolor
766.634.41.4versicolor
776.82.84.81.4versicolor
786.7351.7versicolor
7962.94.51.5versicolor
805.72.63.51versicolor
815.52.43.81.1versicolor
825.52.43.71versicolor
835.82.73.91.2versicolor
8462.75.11.6versicolor
855.434.51.5versicolor
8663.44.51.6versicolor
876.73.14.71.5versicolor
886.32.34.41.3versicolor
895.634.11.3versicolor
905.52.541.3versicolor
915.52.64.41.2versicolor
926.134.61.4versicolor
935.82.641.2versicolor
9452.33.31versicolor
955.62.74.21.3versicolor
965.734.21.2versicolor
975.72.94.21.3versicolor
986.22.94.31.3versicolor
995.12.531.1versicolor
1005.72.84.11.3versicolor
1016.33.362.5virginica
1025.82.75.11.9virginica
1037.135.92.1virginica
1046.32.95.61.8virginica
1056.535.82.2virginica
1067.636.62.1virginica
1074.92.54.51.7virginica
1087.32.96.31.8virginica
1096.72.55.81.8virginica
1107.23.66.12.5virginica
1116.53.25.12virginica
1126.42.75.31.9virginica
1136.835.52.1virginica
1145.72.552virginica
1155.82.85.12.4virginica
1166.43.25.32.3virginica
1176.535.51.8virginica
1187.73.86.72.2virginica
1197.72.66.92.3virginica
12062.251.5virginica
1216.93.25.72.3virginica
1225.62.84.92virginica
1237.72.86.72virginica
1246.32.74.91.8virginica
1256.73.35.72.1virginica
1267.23.261.8virginica
1276.22.84.81.8virginica
1286.134.91.8virginica
1296.42.85.62.1virginica
1307.235.81.6virginica
1317.42.86.11.9virginica
1327.93.86.42virginica
1336.42.85.62.2virginica
1346.32.85.11.5virginica
1356.12.65.61.4virginica
1367.736.12.3virginica
1376.33.45.62.4virginica
1386.43.15.51.8virginica
139634.81.8virginica
1406.93.15.42.1virginica
1416.73.15.62.4virginica
1426.93.15.12.3virginica
1435.82.75.11.9virginica
1446.83.25.92.3virginica
1456.73.35.72.5virginica
1466.735.22.3virginica
1476.32.551.9virginica
1486.535.22virginica
1496.23.45.42.3virginica
1505.935.11.8virginica

In [16]:
help(topic='iris', package='datasets')


Out[16]:
iris {datasets}R Documentation

Edgar Anderson's Iris Data

Description

This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Usage

iris
iris3

Format

iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

iris3 gives the same data arranged as a 3-dimensional array of size 50 by 4 by 3, as represented by S-PLUS. The first dimension gives the case number within the species subsample, the second the measurements with names Sepal L., Sepal W., Petal L., and Petal W., and the third the species.

Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, Part II, 179–188.

The data were collected by Anderson, Edgar (1935). The irises of the Gaspe Peninsula, Bulletin of the American Iris Society, 59, 2–5.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole. (has iris3 as iris.)

See Also

matplot some examples of which use iris.

Examples

dni3 <- dimnames(iris3)
ii <- data.frame(matrix(aperm(iris3, c(1,3,2)), ncol = 4,
                        dimnames = list(NULL, sub(" L.",".Length",
                                        sub(" W.",".Width", dni3[[2]])))),
    Species = gl(3, 50, labels = sub("S", "s", sub("V", "v", dni3[[3]]))))
all.equal(ii, iris) # TRUE

[Package datasets version 3.2.2 ]

In [17]:
names(iris)


Out[17]:
  1. 'Sepal.Length'
  2. 'Sepal.Width'
  3. 'Petal.Length'
  4. 'Petal.Width'
  5. 'Species'

In [18]:
class(iris)


Out[18]:
'data.frame'

In [19]:
str(iris) # str := structure


'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

In [20]:
levels(iris$Species)


Out[20]:
  1. 'setosa'
  2. 'versicolor'
  3. 'virginica'

In [21]:
?levels


Out[21]:
levels {base}R Documentation

Levels Attributes

Description

levels provides access to the levels attribute of a variable. The first form returns the value of the levels of its argument and the second sets the attribute.

Usage

levels(x)
levels(x) <- value

Arguments

x

an object, for example a factor.

value

A valid value for levels(x). For the default method, NULL or a character vector. For the factor method, a vector of character strings with length at least the number of levels of x, or a named list specifying how to rename the levels.

Details

Both the extractor and replacement forms are generic and new methods can be written for them. The most important method for the replacement function is that for factors.

For the factor replacement method, a NA in value causes that level to be removed from the levels and the elements formerly with that level to be replaced by NA.

Note that for a factor, replacing the levels via levels(x) <- value is not the same as (and is preferred to) attr(x, "levels") <- value.

The replacement function is primitive.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

See Also

nlevels, relevel, reorder.

Examples

## assign individual levels
x <- gl(2, 4, 8)
levels(x)[1] <- "low"
levels(x)[2] <- "high"
x

## or as a group
y <- gl(2, 4, 8)
levels(y) <- c("low", "high")
y

## combine some levels
z <- gl(3, 2, 12)
levels(z) <- c("A", "B", "A")
z

## same, using a named list
z <- gl(3, 2, 12)
levels(z) <- list(A = c(1,3), B = 2)
z

## we can add levels this way:
f <- factor(c("a","b"))
levels(f) <- c("c", "a", "b")
f

f <- factor(c("a","b"))
levels(f) <- list(C = "C", A = "a", B = "b")
f

[Package base version 3.2.2 ]

In [22]:
nrow(iris)


Out[22]:
150

In [23]:
ncol(iris)


Out[23]:
5

In [24]:
iris[1:3,] # pay attention to where the comma is!


Out[24]:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
15.13.51.40.2setosa
24.931.40.2setosa
34.73.21.30.2setosa

In [25]:
iris[,1:3]


Out[25]:
Sepal.LengthSepal.WidthPetal.Length
15.13.51.4
24.931.4
34.73.21.3
44.63.11.5
553.61.4
65.43.91.7
74.63.41.4
853.41.5
94.42.91.4
104.93.11.5
115.43.71.5
124.83.41.6
134.831.4
144.331.1
155.841.2
165.74.41.5
175.43.91.3
185.13.51.4
195.73.81.7
205.13.81.5
215.43.41.7
225.13.71.5
234.63.61
245.13.31.7
254.83.41.9
26531.6
2753.41.6
285.23.51.5
295.23.41.4
304.73.21.6
314.83.11.6
325.43.41.5
335.24.11.5
345.54.21.4
354.93.11.5
3653.21.2
375.53.51.3
384.93.61.4
394.431.3
405.13.41.5
4153.51.3
424.52.31.3
434.43.21.3
4453.51.6
455.13.81.9
464.831.4
475.13.81.6
484.63.21.4
495.33.71.5
5053.31.4
5173.24.7
526.43.24.5
536.93.14.9
545.52.34
556.52.84.6
565.72.84.5
576.33.34.7
584.92.43.3
596.62.94.6
605.22.73.9
61523.5
625.934.2
6362.24
646.12.94.7
655.62.93.6
666.73.14.4
675.634.5
685.82.74.1
696.22.24.5
705.62.53.9
715.93.24.8
726.12.84
736.32.54.9
746.12.84.7
756.42.94.3
766.634.4
776.82.84.8
786.735
7962.94.5
805.72.63.5
815.52.43.8
825.52.43.7
835.82.73.9
8462.75.1
855.434.5
8663.44.5
876.73.14.7
886.32.34.4
895.634.1
905.52.54
915.52.64.4
926.134.6
935.82.64
9452.33.3
955.62.74.2
965.734.2
975.72.94.2
986.22.94.3
995.12.53
1005.72.84.1
1016.33.36
1025.82.75.1
1037.135.9
1046.32.95.6
1056.535.8
1067.636.6
1074.92.54.5
1087.32.96.3
1096.72.55.8
1107.23.66.1
1116.53.25.1
1126.42.75.3
1136.835.5
1145.72.55
1155.82.85.1
1166.43.25.3
1176.535.5
1187.73.86.7
1197.72.66.9
12062.25
1216.93.25.7
1225.62.84.9
1237.72.86.7
1246.32.74.9
1256.73.35.7
1267.23.26
1276.22.84.8
1286.134.9
1296.42.85.6
1307.235.8
1317.42.86.1
1327.93.86.4
1336.42.85.6
1346.32.85.1
1356.12.65.6
1367.736.1
1376.33.45.6
1386.43.15.5
139634.8
1406.93.15.4
1416.73.15.6
1426.93.15.1
1435.82.75.1
1446.83.25.9
1456.73.35.7
1466.735.2
1476.32.55
1486.535.2
1496.23.45.4
1505.935.1

In [26]:
iris$Sepal.Length


Out[26]:
  1. 5.1
  2. 4.9
  3. 4.7
  4. 4.6
  5. 5
  6. 5.4
  7. 4.6
  8. 5
  9. 4.4
  10. 4.9
  11. 5.4
  12. 4.8
  13. 4.8
  14. 4.3
  15. 5.8
  16. 5.7
  17. 5.4
  18. 5.1
  19. 5.7
  20. 5.1
  21. 5.4
  22. 5.1
  23. 4.6
  24. 5.1
  25. 4.8
  26. 5
  27. 5
  28. 5.2
  29. 5.2
  30. 4.7
  31. 4.8
  32. 5.4
  33. 5.2
  34. 5.5
  35. 4.9
  36. 5
  37. 5.5
  38. 4.9
  39. 4.4
  40. 5.1
  41. 5
  42. 4.5
  43. 4.4
  44. 5
  45. 5.1
  46. 4.8
  47. 5.1
  48. 4.6
  49. 5.3
  50. 5
  51. 7
  52. 6.4
  53. 6.9
  54. 5.5
  55. 6.5
  56. 5.7
  57. 6.3
  58. 4.9
  59. 6.6
  60. 5.2
  61. 5
  62. 5.9
  63. 6
  64. 6.1
  65. 5.6
  66. 6.7
  67. 5.6
  68. 5.8
  69. 6.2
  70. 5.6
  71. 5.9
  72. 6.1
  73. 6.3
  74. 6.1
  75. 6.4
  76. 6.6
  77. 6.8
  78. 6.7
  79. 6
  80. 5.7
  81. 5.5
  82. 5.5
  83. 5.8
  84. 6
  85. 5.4
  86. 6
  87. 6.7
  88. 6.3
  89. 5.6
  90. 5.5
  91. 5.5
  92. 6.1
  93. 5.8
  94. 5
  95. 5.6
  96. 5.7
  97. 5.7
  98. 6.2
  99. 5.1
  100. 5.7
  101. 6.3
  102. 5.8
  103. 7.1
  104. 6.3
  105. 6.5
  106. 7.6
  107. 4.9
  108. 7.3
  109. 6.7
  110. 7.2
  111. 6.5
  112. 6.4
  113. 6.8
  114. 5.7
  115. 5.8
  116. 6.4
  117. 6.5
  118. 7.7
  119. 7.7
  120. 6
  121. 6.9
  122. 5.6
  123. 7.7
  124. 6.3
  125. 6.7
  126. 7.2
  127. 6.2
  128. 6.1
  129. 6.4
  130. 7.2
  131. 7.4
  132. 7.9
  133. 6.4
  134. 6.3
  135. 6.1
  136. 7.7
  137. 6.3
  138. 6.4
  139. 6
  140. 6.9
  141. 6.7
  142. 6.9
  143. 5.8
  144. 6.8
  145. 6.7
  146. 6.7
  147. 6.3
  148. 6.5
  149. 6.2
  150. 5.9

In [27]:
iris[,'Sepal.Length']


Out[27]:
  1. 5.1
  2. 4.9
  3. 4.7
  4. 4.6
  5. 5
  6. 5.4
  7. 4.6
  8. 5
  9. 4.4
  10. 4.9
  11. 5.4
  12. 4.8
  13. 4.8
  14. 4.3
  15. 5.8
  16. 5.7
  17. 5.4
  18. 5.1
  19. 5.7
  20. 5.1
  21. 5.4
  22. 5.1
  23. 4.6
  24. 5.1
  25. 4.8
  26. 5
  27. 5
  28. 5.2
  29. 5.2
  30. 4.7
  31. 4.8
  32. 5.4
  33. 5.2
  34. 5.5
  35. 4.9
  36. 5
  37. 5.5
  38. 4.9
  39. 4.4
  40. 5.1
  41. 5
  42. 4.5
  43. 4.4
  44. 5
  45. 5.1
  46. 4.8
  47. 5.1
  48. 4.6
  49. 5.3
  50. 5
  51. 7
  52. 6.4
  53. 6.9
  54. 5.5
  55. 6.5
  56. 5.7
  57. 6.3
  58. 4.9
  59. 6.6
  60. 5.2
  61. 5
  62. 5.9
  63. 6
  64. 6.1
  65. 5.6
  66. 6.7
  67. 5.6
  68. 5.8
  69. 6.2
  70. 5.6
  71. 5.9
  72. 6.1
  73. 6.3
  74. 6.1
  75. 6.4
  76. 6.6
  77. 6.8
  78. 6.7
  79. 6
  80. 5.7
  81. 5.5
  82. 5.5
  83. 5.8
  84. 6
  85. 5.4
  86. 6
  87. 6.7
  88. 6.3
  89. 5.6
  90. 5.5
  91. 5.5
  92. 6.1
  93. 5.8
  94. 5
  95. 5.6
  96. 5.7
  97. 5.7
  98. 6.2
  99. 5.1
  100. 5.7
  101. 6.3
  102. 5.8
  103. 7.1
  104. 6.3
  105. 6.5
  106. 7.6
  107. 4.9
  108. 7.3
  109. 6.7
  110. 7.2
  111. 6.5
  112. 6.4
  113. 6.8
  114. 5.7
  115. 5.8
  116. 6.4
  117. 6.5
  118. 7.7
  119. 7.7
  120. 6
  121. 6.9
  122. 5.6
  123. 7.7
  124. 6.3
  125. 6.7
  126. 7.2
  127. 6.2
  128. 6.1
  129. 6.4
  130. 7.2
  131. 7.4
  132. 7.9
  133. 6.4
  134. 6.3
  135. 6.1
  136. 7.7
  137. 6.3
  138. 6.4
  139. 6
  140. 6.9
  141. 6.7
  142. 6.9
  143. 5.8
  144. 6.8
  145. 6.7
  146. 6.7
  147. 6.3
  148. 6.5
  149. 6.2
  150. 5.9

In [60]:
setosa <- iris[iris$Species == "setosa",] # get rows from iris where Species == "setosa"
linearModel <- lm(formula = Sepal.Length ~ Sepal.Width, data = setosa) # ~ := "as explained by"

In [61]:
linearModel


Out[61]:
Call:
lm(formula = Sepal.Length ~ Sepal.Width, data = setosa)

Coefficients:
(Intercept)  Sepal.Width  
     2.6390       0.6905  

In [62]:
names(linearModel)


Out[62]:
  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'

In [63]:
linearModel$call


Out[63]:
lm(formula = Sepal.Length ~ Sepal.Width, data = setosa)

In [64]:
plot(formula = Sepal.Length ~ Sepal.Width, data = setosa)
abline(linearModel)



In [65]:
# sapply is like a map function
squaredResiduals <- sapply(linearModel$residuals, function(x) return(x^2)) 
squaredError <- sum(squaredResiduals) / length(squaredResiduals)
squaredError


Out[65]:
0.0546263038291104

In [66]:
# What if we remove the potential outlier at (2.3, 4.5)?
setosa[42,]


Out[66]:
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
424.52.31.30.3setosa

In [81]:
setosaMinusOutlier <- setosa[-42,]
fixedLinearModel <- lm(Sepal.Length ~ Sepal.Width, setosaMinusOutlier)
plot(Sepal.Length ~ Sepal.Width, setosaMinusOutlier)
abline(fixedLinearModel)



In [82]:
fixedSquaredResiduals <- sapply(fixedLinearModel$residuals, function(x) return(x^2)) 
fixedSquaredError <- sum(fixedSquaredResiduals) / length(fixedSquaredResiduals)
fixedSquaredError


Out[82]:
0.0538399520729466

Not much improvement in this case.

When is removing outliers justified? What if we had seen a significant decrease in the squared error? Is Sepal.Length ~ Sepal.Width a smart thing to model, or are there hidden variables? (i.e., mean amount of daily sunshine, precipitation levels, exposure to wind, ...).


In [83]:
# now for some classification
flowers <- subset(iris, Species == "setosa" | Species == "virginica")
# equivalently flowers <- iris[iris$Species == "setosa" | iris$Species == "virginica",]
# R has different "OR" operators depending on whether we are doing operations 
# on matrix like objects or atomic objects. Be sure to use a single bar | here.

In [84]:
# Here our data has two classes, hence we are performing "binary classification".
plot(flowers[,1:4], pch=sapply(flowers$Species, substr, 1, 1))



In [85]:
flowers$Species <- as.factor(flowers$Species)

In this case, this is unnecessary because class(flowers$Species) == factor already, but in general we need to make our categorical response variables factors for classification.


In [86]:
flowers <- droplevels(flowers) # need to drop "versicolor" from Species levels since it's empty
dotchart(flowers$Petal.Length, pch=sapply(flowers$Species, substr, 1, 1))


Notice how the data is linearly separable. (In dot charts the y-axis is meaningless).


In [87]:
# 1-d case
library(MASS)
oneLinearClass <- lda(Species ~ Petal.Length, flowers)

lda stands for Linear Discriminant Analysis. How it works is not important for now. Just know that it attempts to draw a dot/line/plane/hyperplane separating our two classes.


In [88]:
onePredictions <- predict(oneLinearClass, flowers)
onePredictionResults <- table(onePredictions$class == flowers$Species) / length(flowers$Species)
onePredictionResults


Out[88]:
TRUE 
   1 

Wow! 0 error! Who would've thunk.


In [89]:
# 2-d case
plot(Sepal.Length ~ Sepal.Width, flowers, pch=sapply(flowers$Species, substr, 1, 1))



In [90]:
twoLinearClass <- lda(Species ~ Sepal.Length + Sepal.Width, flowers)
twoPredictions <- predict(twoLinearClass, flowers)
twoPredictionResults <- table(twoPredictions$class == flowers$Species) / length(flowers$Species)
twoPredictionResults


Out[90]:
FALSE  TRUE 
 0.01  0.99 

Here LDA is not perfect, although our data is (barely) linearly separable. This is because we abused the LDA assumption that the two classes have the same covariance matrix. (More on this in later lectures).

Now let's try the entire iris dataset, including Species == "versicolor" in n-dimensions. n = 4 here, since we have 4 features (we don't include our "Species" label).


In [91]:
labelMapper <- function(s) { # this is just to help us plot the data in the next line
    if (s == "setosa") return(1)
    if (s == "versicolor") return(2)
    if (s == "virginica") return(3)
}
plot(iris[,1:4], pch=sapply(iris$Species, labelMapper))



In [92]:
nLinearClass <- lda(Species ~ ., iris) # Species ~ . := Species as explained by "everything"
nPredictions <- predict(nLinearClass, iris)
nPredictionResults <- table(nPredictions$class == iris$Species) / length(iris$Species)
nPredictionResults


Out[92]:
FALSE  TRUE 
 0.02  0.98 

This is actually pretty good, especially since all we're doing is drawing a hyperplane through our 4 dimensional data!

BUT, as we'll see next lecture, we have done something egregious that has given us a false sense of confidence about the accuracy of our predictor...